Lecture 10: Confusion matrices and model fit for linear regression
EBA3500 Data analysis with programming
Jonas Moss
BI Norwegian Business School Department of Data Science and Analytics
Confusion matrices, ROC, and AUC
\(0-1\) predictions in binary regression
The predict method gives you a probability.
But often you have to make a make a prediction of \(\hat{Y}\) itself.
Will you follow a marketing lead? Do you believe candidate will get an \(A\) when he drinks 5 cups of coffee?
You usually construct predictions of \(Y\) using a threshold: \[\hat{Y}=1 \iff \hat{p} > c,\] where \(\hat{p}\) is a predicted probability obtained using predict and \(\hat{Y}\) is the predicted value of \(Y\).
For instance, you could decide to do the action if if \(\hat{p} > 0.5\).
Admission data (again)
import pandas as pdimport statsmodels.formula.api as smfimport matplotlib.pyplot as pltadmission = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")admission.head()
admit
gre
gpa
rank
0
0
380
3.61
3
1
1
660
3.67
3
2
1
800
4.00
1
3
1
640
3.19
4
4
0
520
2.93
4
model = smf.logit("admit ~ gre + gpa + C(rank)", data = admission).fit()plt.hist(model.predict())
Optimization terminated successfully.
Current function value: 0.573147
Iterations 6
The confusion matrix shows shows true positives, false negatives, false positives, and true negatives.
\(\hat{Y}=1\)
\(\hat{Y}=0\)
\(Y=1\)
True positive
False negative
\(Y=0\)
False positive
True negative
Confusion matrix for admission with c = 0.5
model.pred_table(threshold =0.5)
array([[254., 19.],
[ 97., 30.]])
\(\hat{Y}=1\)
\(\hat{Y}=0\)
\(Y=1\)
30
97
\(Y=0\)
19
254
Question: How many true positives are there?
Answer: True positives happen when both \(\hat{Y}=1\) and \(Y=1\). Hence the correct answer is \(30\).
Question: How many false negatives are there?
Answer: False negatives correspond to \(\hat{Y}=0\) and \(Y=1\). There are 97 of these.
Confusion matrix for admission
model.pred_table(threshold =0.6)
array([[266., 7.],
[114., 13.]])
\(\hat{Y}=1\)
\(\hat{Y}=0\)
\(Y=1\)
13
114
\(Y=0\)
7
266
model.pred_table(threshold =0.5)
array([[254., 19.],
[ 97., 30.]])
\(\hat{Y}=1\)
\(\hat{Y}=0\)
\(Y=1\)
30
97
\(Y=0\)
19
254
Confusion matrix for admission
Question: How can you make sure the number of false positives is \(0\)?
Answer: Make the threshold \(c=1\)!
model.pred_table(threshold =1)
array([[273., 0.],
[127., 0.]])
True positive rate and false positive rate
\[
\text{true positive rate}=\frac{\text{true positives}}{\text{true positives} + \text{false negatives}}
\] Also known as sensitivity, recall, and hit rate. Abbreviated as tpr.
Here \(TPR = \frac{30}{30 + 97} = 0.24\) and \(FPR = \frac{19}{19 + 254} = 0.07\).
Receiver operating characteristic curve
The receiver operating characteristic curve. Source: Wikipedia.
How to make these plots
from sklearn import metricsfpr, tpr, _ = metrics.roc_curve(admission.admit, model.predict())plt.plot(fpr, tpr)plt.show()
The area under the curve (i)
The area under the curve (ii)
The area under the curve (AUC) is the integral of the receiver operating characteristic curve.
\(0.5 \leq AUC \leq 1\).
\(AUC = 1\) if we can predict perfectly.
\(AUC = 0.5\) if we can’t predict at all.
Extremely widespread!
Since the AUC doesn’t penalize model complexity it shouldn’t be used to compare models!
The \(R^2\) for binary regression models
Remember the setup for the \(R^{2}\) for non-linear regression: \[
1-\frac{\sum_{i=1}^{n}(y_{i}-f(x;a_{1},\ldots a_{p}))^{2}}{\sum_{i=1}^{n}(y_{i}-m)^{2}}\] In general we could write \[R^{2}=1-\frac{R(x)}{R(m)}\] Where
\(R(m)\) measures the distance between our best predictions and the data \(y\), and
\(R(x)\) measures the distance between our best predictions and the data \(y\) when we take \(x\) into account.
The log-likelihood and the \(R^2\)
We use the log-likelihood instead of the quadratic loss! So, letting \(p_i = f(x;a_{1},\ldots a_{p})\), define the fitted likelihood \[l(x) = R(x) = \sum_{i=1}^{n}-y_{i}\log \hat{p}_{i}-(1-y_{i})\log(1-\hat{p}_{i}).\]
\[l(m) = R(m) = \sum_{i=1}^{n}-y_{i}\log m-(1-y_{i})\log(1-m),\] is the log likelihood of the null model. It measures how well we can predict \(y\) when we don’t know any \(x\) at all.
McFadden’s \(R^2\)
Now define the pseudo-\(R^2\) or McFadden \(R^2\) as \[R^2_{\textrm{McFadden}} = 1 - \frac{l(x)}{l(m)}.\]
Since McFadden’s \(R^2\) doesn’t penalize model complexity it shouldn’t be used to compare models!
An example
probit_fit = smf.probit("admit ~ gre + C(rank) + gpa", data = admission).fit()logit_fit = smf.logit("admit ~ gre + C(rank) + gpa", data = admission).fit()def rsq_mcfadden(fit): lower = fit.llnull upper = fit.llfreturn1- upper / lowerprint(rsq_mcfadden(probit_fit), rsq_mcfadden(logit_fit))
Optimization terminated successfully.
Current function value: 0.573016
Iterations 5
Optimization terminated successfully.
Current function value: 0.573147
Iterations 6
0.08313059666890721 0.08292194470084713
Thus the \(R^2\)s are \(0.0831\) and \(0.0829\). But we may also access McFadden’s \(R^2\) directly: print(probit_fit.prsquared, logit_fit.prsquared).
Question: Can you compare these \(R^2\)s to the \(R^2\)s from linear regression?
Answer: No, as we are looking the likelihoods now, not the quadratic distance. Also, the “ordinary” \(R^2\) tends to be small for binary data.
McFadden’s \(R^2\) vs the ordinary \(R^2\)
The following picture, Figure 5 from Chapter 5 of McFadden’s Urban Travel Demand: A Behavioral Analysis (1996) illustrates the typical relationsship between the least squares \(R^2\) (on the \(y\)-axis) and the pseudo-\(R^2\) on the \(x\)-axis.
McFadden’s comparison if the ordinary R^2 and McFadden’s R^2. Source: McFadden, D. Urban Travel Demand: A Behavioral Analysis (1996)
Variable selection in linear regression
Akaike’s information criterion (recap)
Akaike’s information criterion (AIC), defined as \[\text{AIC} = 2p - 2l,\] where \(p\) is the number of estimated parameters and \(l\) is the log-likelihood at the estimated parameters.
The smaller the AIC the better!
We add \(2p\) since a larger number of parameters allows for more noise. It penalizes model complexity.
Penalization of model complexity is always needed when comparing model with different number of parameters!
There is a theoretical argument for the choice \(p\)! but that is beyond the scope of this course.
The adjusted \(R^2\)
The \(R^2\) is good for evaluating how well be can predict the outcome given our covariates, but it’s not good for choosing between models.
It doesn’t correct for the bias that occurs when using the same data both to estimate the model parameters and evaluating model fit.
import numpy as nprng = np.random.default_rng(seed =313)p =10n =100x = rng.normal(0, 1, (n, p)) # think about y ~ 1 + x1 + x2 + ... + x_py = rng.normal(3, 2, n) # True R^2 is 0!
The true \(R^2\) for y ~ x is \(0\) in this model. However, its expectation is not.
Can show that \[R^2 \sim \textrm{Beta}\left(\frac{k-1}{2},\frac{n-k}{2}\right),\] where \(k\) is the number of parameters including the intercept. (That is, \(k = p + 1\).)
Using the properties of the Beta distribution, we find that the expected value of \(R^2\) is \[E(R^2) = \frac{\frac{k-1}{2}}{\frac{k-1}{2} + \frac{n-k}{2}} = \frac{k-1}{n - 1}.\]
This number might be quite large when the number of predictors is close to the number of observations!
import statsmodels.formula.api as smfn_reps =5000yy = rng.normal(3, 2, (n_reps, n))def func1d(y): fit = smf.ols("y ~ x", data = {}).fit() # y ~ x1 + x2 + ... + x_preturn fit.rsquaredrsqs = np.apply_along_axis( func1d = func1d, axis =1, arr = yy)[np.mean(rsqs), p / (n-1)]
[0.10158243551749126, 0.10101010101010101]
Constructing the adjusted \(R^2\) (i)
Recall the definition of the \(R^2\): \[R^2 = 1 - \frac{\textrm{Sum of squares with predictors}}{\textrm{Sum of squares without predictors}}\]
We can show that \(\textrm{Sum of squares with predictors}\) is biased for its population value, the true sum of squares with predictors at our estimated regression coefficient. But we can correct for this bias!
One can show that \[\frac{n}{n-p-1}E(\textrm{Sum of squares with predictors})\] equals the true population sum of squares with predictors. Here \(p\) is the number of estimated regression coefficients, minus the intercept.
Constructing the adjusted \(R^2\) (ii)
Moreover, we can show that \[\frac{n}{n-1} E(\textrm{Sum of squares with predictors})\]equals the true, population value of the sum of squares without predictors.
It follows that a reasonable corrected \(R^2\) is \[R_a^2 = 1 - \frac{\frac{n}{n-p-1} \textrm{Sum of squares with predictors}}{\frac{n}{n-1}\textrm{Sum of squares without predictors}}\]
Rearranging this, we find that \[R_a ^ 2 = 1 - (1 - R^2) \frac{n-1}{n-p-1} .\]
Notes on the adjusted \(R^2\)
The adjusted \(R^2\), or \(R^2_a\), can be less than \(0\). Try to understand why.
We haven’t proved that the adjusted \(R^2\) squared is unbiased for the true \(R^2\). Do you think it is?
Can you devise a simulation study to explore this problem?
Comments
These methods can be used together with automatic mechanisms for selecting covariates, e.g. backwards and forwards regression.
The BIC (“Bayesian information criterion”) is similar to the AIC, but defined as \(k\log n - 2\log L\)! (The AIC is \(2k-2\log L\)).
There are many different methods too, but adjusted \(R^2\), ANOVA and AIC are most popular.
Model fit
Conditions for inference in regression
The linear relationsship is true, \(y_i=\beta^T x_i + \epsilon\).
No autocorrelaton. The errors \(\epsilon_i\) are independent of each other given \(x_i\). (This is important.)
Normally distributed errors. The errors are normally distributed. (Not very important.)
Homoskedastic errors. The variance of the errors should not depend on \(x_i\).
Residuals
Defined as \(\hat{y} - y\). It’s common to look at plots of residuals.
We often plot “residuals vs. fitted”, i.e. \(\hat{y} - y\) vs \(\hat{y}\)
<matplotlib.collections.PathCollection at 0x14248ab5960>
The residuals are smaller when the absolute value of the fitted observations is small!
You can make such plots when dealing with multiple covariates too.
QQ plot of residuals
Source: Introduction to Linear Regression Analysis (Elizabeth Peck, Geoffrey Vining)
QQ plot: About \(n\)
Small \(n<20\) often don’t yield useful quantile-quantile plots!
Bu \(n>30\) usually suffices.
QQ plot: \(n = 100\), normal distribution
import numpy as npimport statsmodels.api as smimport matplotlib.pyplot as pltrng = np.random.default_rng(seed =313)x = rng.normal(size =100)sm.qqplot(x, line ='45')plt.show()
QQ plot: \(n = 10\), normal distribution
import numpy as npimport statsmodels.api as smimport matplotlib.pyplot as pltrng = np.random.default_rng(seed =3)x = rng.normal(size =10)sm.qqplot(x, line ='45')plt.show()
rng = np.random.default_rng(seed =3)x = rng.uniform(size =50, low =0, high =1)sm.qqplot(x, line ='45')plt.show()
Heteroskedasticity
import seaborn as snsrng = np.random.default_rng(seed =313)x = rng.uniform(size=50)y =1+ x + rng.normal(size=50) * (x +0.5) **2data = pd.DataFrame({"x": x, "y": y})sns.lmplot(data = data, x ="x", y ="y")
<seaborn.axisgrid.FacetGrid at 0x1424bf155d0>
Robust standard error’s using White (1980)
Inferential procedures such as confidence intervals and p-values are invalid under heteroskedasticity.
Us the argument cov_type='HC0' inside the fit methods for robust standard errors.
smf.ols("y ~ x", data = data).fit().pvalues
Intercept 0.018321
x 0.035364
dtype: float64
smf.ols("y ~ x", data = data).fit(cov_type='HC0').pvalues
Intercept 0.002156
x 0.028677
dtype: float64
White, H. (1980). A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica, 48(4), 817–838. https://doi.org/10.2307/1912934
Robust regression
Telephone data
Number of international calls from Belgium, taken from the Belgian Statistical Survey, published by the Ministry of Economy.
import seaborn as snsimport rdatasets as rdtelef = rd.data("Robustbase", "telef")sns.lmplot(data = telef, x ="Year", y ="Calls")
<seaborn.axisgrid.FacetGrid at 0x1424c7d2440>
What’s happening here? Clerical errors! Such errors are abundant.
Two ways to fix such problems:
(a) Remove the “bad” data. But this is not always feasible.
(b) Use robust regression methods. Such methods are – sometimes – able to automatically ignore data that should be ignored.
Using rlm
robust = smf.rlm("Calls ~ Year", data = telef, M = sm.robust.norms.HuberT()).fit()
sns.lmplot(data = telef, x ="Year", y ="Calls")plt.plot(telef["Year"], robust.predict(), color ="red")
robust = smf.rlm("prestige ~ income + education", data = prestige, M = sm.robust.norms.HuberT()).fit()print(robust.summary())
Robust linear Model Regression Results
==============================================================================
Dep. Variable: prestige No. Observations: 45
Model: RLM Df Residuals: 42
Method: IRLS Df Model: 2
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Tue, 01 Nov 2022
Time: 10:36:22
No. Iterations: 18
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -7.1107 3.879 -1.833 0.067 -14.713 0.492
income 0.7015 0.109 6.456 0.000 0.489 0.914
education 0.4854 0.089 5.441 0.000 0.311 0.660
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
Collinear covariate matrices
Perfect collinearity
A set of covariates is perfectly collinear if you can write one of them as linear combination of the others.
If you have covariates \(x_1, x_2, x_3, \ldots, x_p\), they are perfectly collinear if \(x_1 = \sum_2^p \alpha_i x_i\) for some vector \(\alpha\).
import statsmodels.api as smy = x(0.000000000000001)z = y @ np.array([1, 2, 3, 4])mod = sm.OLS(endog = z, exog = y).fit()print(mod.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 3.854e+30
Date: Tue, 01 Nov 2022 Prob (F-statistic): 2.60e-31
Time: 10:36:24 Log-Likelihood: 148.69
No. Observations: 5 AIC: -291.4
Df Residuals: 2 BIC: -292.6
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0000 9.98e-14 1e+13 0.000 1.000 1.000
x1 1.0000 4.66e-14 2.14e+13 0.000 1.000 1.000
x2 4.0000 1.73e-14 2.31e+14 0.000 4.000 4.000
x3 3.0000 2.94e-14 1.02e+14 0.000 3.000 3.000
==============================================================================
Omnibus: nan Durbin-Watson: 0.245
Prob(Omnibus): nan Jarque-Bera (JB): 0.824
Skew: 0.944 Prob(JB): 0.662
Kurtosis: 2.378 Cond. No. 1.30e+17
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.56e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Bouncing betas!
import statsmodels.api as smy = x(0.00000001)z = y @ np.array([1, 2, 3, 4])mod = sm.OLS(endog = z, exog = y).fit()print(mod.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.130e+12
Date: Tue, 01 Nov 2022 Prob (F-statistic): 6.92e-07
Time: 10:36:24 Log-Likelihood: 44.752
No. Observations: 5 AIC: -81.50
Df Residuals: 1 BIC: -83.07
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0000 0.000 5118.979 0.000 0.998 1.002
x1 2.0000 2.08e+04 9.64e-05 1.000 -2.64e+05 2.64e+05
x2 3.0000 2.08e+04 0.000 1.000 -2.64e+05 2.64e+05
x3 4.0000 2.08e+04 0.000 1.000 -2.64e+05 2.64e+05
==============================================================================
Omnibus: nan Durbin-Watson: 0.227
Prob(Omnibus): nan Jarque-Bera (JB): 0.529
Skew: -0.587 Prob(JB): 0.767
Kurtosis: 1.922 Cond. No. 2.06e+10
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.81e-18. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
A modification of 0.00000001 changed three coefficient estimates dramatically!
This phenomenon is called bouncing betas.
Most common when using many categorical covariates.
Summary (i)
The ROC displays the relationsship between the true positive rate and the false negative rate.
The AUC measures how good your model is at predicting, with \(0.5\) being no better than chance and \(1\) being perfect. The AUC should not be used to compare models. Use the AIC for that.
McFadden’s \(R^2\) is a generalization of the \(R^2\) to binary models.
Summary (ii)
You can handle outliers and clerical errors using robust regression models.
Residual plots can be used to find out if the residuals are heteroskedastic. If the residuals are heteroskedastic, you may use White (1980) corrected standard errors.
Quantile-quantile plots are used to check if residuals are normal. Be on the lookout for heavy tails.
Almost collinear covariate matrices make inference hard and can produce bouncing betas.
The adjusted \(R^2\) can be used for model selection.
Comments